AITopics | model failure

Collaborating Authors

model failure

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models

Neural Information Processing SystemsFeb-18-2026, 04:41:38 GMT

The predominant de facto paradigm of testing ML models relies on either using only held-out data to compute aggregate evaluation metrics or by assessing the performance on different subgroups.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > Scotland (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Education (1.00)
Health & Medicine > Consumer Health (0.92)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

6ea56c0baacac9f7764257a43a93c90a-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-15-2026, 16:57:06 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
North America > Dominican Republic (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Consumer Products & Services (0.68)
Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Automated Classification of Model Errors on ImageNet

Neural Information Processing SystemsFeb-14-2026, 12:42:05 GMT

While the ImageNet dataset has been driving computer vision research over the past decade, significant label noise and ambiguity have made top-1 accuracy an insufficient measure of further progress.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Staffordshire (0.04)
Asia > China (0.04)
(6 more...)

Genre: Research Report (0.68)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Leisure & Entertainment > Sports (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ac27b77292582bc293a51055bfc994ee-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-13-2026, 13:16:00 GMT

metamer, model metamer, optimization, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models

Neural Information Processing SystemsOct-10-2025, 16:48:54 GMT

The predominant de facto paradigm of testing ML models relies on either using only held-out data to compute aggregate evaluation metrics or by assessing the performance on different subgroups.

dataset, hypothesis, model failure, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > Scotland (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Education (1.00)
Health & Medicine > Consumer Health (0.92)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

6ea56c0baacac9f7764257a43a93c90a-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 05:32:53 GMT

arxiv preprint arxiv, benchmark, reasoning, (13 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
North America > Dominican Republic (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Consumer Products & Services (0.68)
Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

7480ed13740773505262791131c12b89-Paper-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 22:02:32 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Staffordshire (0.04)
Asia > China (0.04)
(6 more...)

Genre: Research Report (0.68)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Leisure & Entertainment > Sports (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

R1 had questions about the criterion for deciding that a

Neural Information Processing SystemsAug-19-2025, 23:11:43 GMT

We will include another sentence in the discussion explicitly addressing the relationship to [41].

artificial intelligence, metamer, model metamer, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation

Gu, Jian, Aleti, Aldeida, Chen, Chunyang, Zhang, Hongyu

arXiv.org Artificial IntelligenceMar-17-2025

Language Models (LMs) are widely used in software engineering for code generation, but they may produce code with errors. Rather than repairing the generated code, an alternative way is to address the underlying failures of models. LM repair offers a lightweight solution to this challenge: it requires minimal data, reduces computational costs, and reduces the side effects. Unlike retraining, LM repair focuses on applying tailored updates to targeted neurons, making it ideal for scenarios with limited resources, high-performance demands, or strict safety requirements. In this paper, we propose \ul{S}emantic \ul{T}argeting for \ul{A}nalytical \ul{R}epair (\textsc{STAR}), a pioneering and novel semantic-based optimization approach for repairing LLMs. \textsc{STAR} realizes main operations in LM repair methods in an optimization process, including locating ``buggy neurons'', solving ``neuron patches'', and patching ``buggy neurons''. Correspondingly, it computes the deltas of weight matrix as the prior information to guide optimization; and attributes the targeted layers and neurons leveraging statistical insights. The neuron patches are computed with a solid semantic-based analytical formula, which directly bridges the changes to logits with the deltas of neurons, by steering latent representations. Compared to the prior work of LM repair (\textsc{MINT}) and optimization methods (\textsc{SGD}), \textsc{STAR} integrates their strengths while mitigating their limitations. \textsc{STAR} supports solving multiple failures together, significantly improving the usefulness. Evaluated on three code generation tasks using popular code LMs, \textsc{STAR} demonstrates superior effectiveness. Additionally, \textsc{STAR} exhibits better efficiency. In terms of side effects, namely the balance between generalization and specificity, \textsc{STAR} outperforms prior work by a significant margin.

artificial intelligence, natural language, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

2503.12899

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > China > Chongqing Province > Chongqing (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Do Large Language Model Benchmarks Test Reliability?

Vendrow, Joshua, Vendrow, Edward, Beery, Sara, Madry, Aleksander

arXiv.org Artificial IntelligenceFeb-5-2025

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.03461

Country: